Research Synthesis Methods — Latest Matching Preprints

1

JARVIS, should this study be selected for full-text screening? Performance of a Joint AI-ReViewer Interactive Screening tool for systematic reviews

Barreto, G. H. C.; Burke, C.; Davies, P.; Halicka, M.; Paterson, C.; Swinton, P.; Saunders, B.; Higgins, J. P. T.

2026-04-11 health informatics 10.64898/2026.04.08.26350384 medRxiv

Top 0.1%

19.2%

Show abstract

BackgroundSystematic reviews are essential for evidence-based decision making in health sciences but require substantial time and resource for manual processes, particularly title and abstract screening. Recent advances in machine learning and large language models (LLMs) have demonstrated promise in accelerating screening with high recall but are often limited by modest gains in efficiency, mostly due to the absence of a generalisable stopping criterion. Here, we introduce and report preliminary findings on the performance of a novel semi-automated active learning system, JARVIS, that integrates LLM-based reasoning using the PICOS framework, neural networks-based classification, and human decision-making to facilitate abstract screening. MethodsDatasets containing author-made inclusion and exclusion decisions from six published systematic reviews were used to pilot the semi-automated screening system. Model performance was evaluated across recall, specificity and area under the curve precision-recall (AUC-PR), using full-text inclusion as the ground truth. Estimated workload and financial savings were calculated by comparing total screening time and reviewer costs across manual and semi-automated scenarios. ResultsAcross the six review datasets, recall ranged between 98.2% and 100%, and specificity ranged between 97.9% and 99.2% at the defined stopping point. Across iterations, AUC-PR values ranged between 83.8% and 100%. Compared with human-only screening, JARVIS delivered workload savings between 71.0% and 93.6%. When a single reviewer read the excluded records, workload savings ranged between 35.6 % and 46.8%. ConclusionThe proposed semi-automated system substantially reduced reviewer workload while maintaining high recall, improving on previously reported approaches. Further validation in larger and more varied reviews, as well as prospective testing, is warranted.

2

Cochrane Evaluation of (Semi-) Automated Review (CESAR) Methods: Protocol for an adaptive platform study within reviews

Gartlehner, G.; Banda, S.; Callaghan, M.; Chase, J.-A.; Dobrescu, A.; Eisele-Metzger, A.; Flemyng, E.; Gardner, S.; Griebler, U.; Helfer, B.; Jemiolo, P.; Macura, B.; Minx, J. C.; Noel-Storr, A.; Rajabzadeh Tahmasebi, N.; Sharifan, A.; Meerpohl, J.; Thomas, J.

2026-04-15 health informatics 10.64898/2026.04.13.26350802 medRxiv

Top 0.1%

18.3%

Show abstract

Background: Artificial intelligence (AI) has the potential to improve the efficiency of evidence synthesis and reduce human error. However, robust methods for evaluating rapidly evolving AI tools within the practical workflows of evidence synthesis remain underdeveloped. This protocol describes a study design for assessing the effectiveness, efficiency, and usability of AI tools in comparison to traditional human-only workflows in the context of Cochrane systematic reviews. Methods: Members of the Cochrane Evaluation of (Semi-) Automated Review (CESAR) Methods Project developed an adaptive platform study-within-a-review (SWAR) design, modeled after clinical platform trials. This design employs a master protocol to concurrently evaluate multiple AI tools (interventions) against a standard human-only process (control) across three key review tasks: title and abstract screening, full-text screening, and data extraction. The adaptive framework allows for the addition or removal of AI tools based on interim performance analyses without necessitating a restart of the study. Performance will be assessed using metrics such as accuracy (sensitivity, specificity, precision), efficiency (time on task), response stability, impact of errors, and usability, in alignment with Responsible use of AI in evidence SynthEsis (RAISE) principles. Results: The study will generate comparative data about the performance and usability of specific AI tools employed in a semi- or fully automated manner relative to standard human effort. The protocol provides a flexible framework for the assessment of AI tools in evidence synthesis, addressing the limitations of static, one-time evaluations. Discussion: This study protocol presents a novel methodological approach to addressing the challenges of evaluating AI tools for evidence syntheses. By validating entire workflows rather than individual technologies, the findings will establish an evidence base for determining the viability of integrating AI into evidence-synthesis workflows. The adaptive design of this study is flexible and can be adopted by other investigators, ensuring that the evaluation framework remains relevant as new tools emerge.

3

Protocol for LLM-Generated CONSORT Report for Increased Reporting: A Parallel-Arm Randomized Controlled Trial (Protocol)

Krauska, A. N.; Rohe, K.

2026-04-17 health policy 10.64898/2026.04.15.26350926 medRxiv

Top 0.1%

10.0%

Show abstract

Background Randomized controlled trials (RCTs) often have incomplete methods reporting despite widespread adoption of the CONSORT guideline. The editorial process is supposed to detect these shortcomings and request clarifications from authors, which is time-consuming. We developed an LLM-based CONSORT Rohe Nordberg Report that highlights which CONSORT items appear fully or partially reported and checks page references claimed by authors, and then creates follow up questions for authors to more easily correct missing information. Methods This parallel-arm, superiority RCT will randomize eligible RCT submissions (after desk screening) 1:1 into intervention (editorial team and authors receive the Rohe Nordberg Report) or control (standard editorial review only). The primary outcome is whether manuscripts improve their reporting of CONSORT items in the Methods and Results sections between the original submission and first revision. This will be assessed by blinded human reviewers who evaluate the textual changes for improvements between the original and revised manuscripts for each relevant CONSORT item. Secondary outcomes include time to editorial decisions, rejection and non-resubmission rates, if authors can correctly identify where CONSORT items are reported, and extent of revisions. Human evaluators will be blinded to whether the manuscript was in the intervention or control group. Discussion By providing authors and the editorial team with specific follow up questions for each underreported CONSORT item, we hypothesize that basic underreporting will be more efficiently detected and corrected. Using blinded human reviewers as the primary outcome assessors ensures a rigorous, unbiased evaluation. If successful, this approach may help align manuscripts more closely with CONSORT standards, ultimately benefiting evidence synthesis.

4

Challenges in the Computational Reproducibility of Linear Regression Analyses: An Empirical Study

Jones, L. V.; Barnett, A.; Hartel, G.; Vagenas, D.

2026-04-07 health systems and quality improvement 10.64898/2026.04.07.26350286 medRxiv

Top 0.1%

9.4%

Show abstract

Background: Reproducibility concerns in health research have grown, as many published results fail to be independently reproduced. Achieving computational reproducibility, where others can replicate the same results using the same methods, requires transparent reporting of statistical tests, models, and software use. While data-sharing initiatives have improved accessibility, the actual usability of shared data for reproducing research findings remains underexplored. Addressing this gap is crucial for advancing open science and ensuring that shared data meaningfully support reproducibility and enable collaboration, thereby strengthening evidence-based policy and practice. Methods: A random sample of 95 PLOS ONE health research papers from 2019 reporting linear regression was assessed for data-sharing practices and computational reproducibility. Data were accessible for 43 papers. From the randomly selected sample, the first 20 papers with available data were assessed for computational reproducibility. Three regression models per paper were reanalysed. Results: Of the 95 papers, 68 reported having data available, but 25 of these lacked the data required to reproduce the linear regression models. Only eight of 20 papers we analysed were computationally reproducible. A major barrier to reproducing the analyses was the great difficulty in matching the variables described in the paper to those in the data. Papers sometimes failed to be reproduced because the methods were not adequately described, including variable adjustments and data exclusions. Conclusion: More than half (60%) of analysed studies were not computationally reproducible, raising concerns about the credibility of the reported results and highlighting the need for greater transparency and rigour in research reporting. When data are made available, authors should provide a corresponding data dictionary with variable labels that match those used in the paper. Analysis code, model specifications, and any supporting materials detailing the steps required to reproduce the results should be deposited in a publicly accessible repository or included as supplementary files. To increase the reproducibility of statistical results, we propose a Model Location and Specification Table (MLast), which tracks where and what analyses were performed. In conjunction with a data dictionary, MLast enables the mapping of analyses, greatly aiding computational reproducibility.

5

An Empirical Assessment of Inferential Reproducibility of Linear Regression in Health and Biomedical Research Papers

Jones, L.; Barnett, A.; Hartel, G.; Vagenas, D.

2026-04-07 health systems and quality improvement 10.64898/2026.04.07.26350296 medRxiv

Top 0.1%

4.9%

Show abstract

Background: In health research, variability in modelling decisions can lead to different conclusions even when the same data are analysed, a challenge known as inferential reproducibility. In linear regression analyses, incorrect handling of key assumptions, such as normality of the residuals and linearity, can undermine reproducibility. This study examines how violations of these assumptions influence inferential conclusions when the same data are reanalysed. Methods: We randomly sampled 95 health-related PLOS ONE papers from 2019 that reported linear regression in their methods. Data were available for 43 papers, and 20 were assessed for computational reproducibility, with three models per paper evaluated. The 14 papers that included a model at least partially computationally reproduced were then examined for inferential reproducibility. To assess the impact of assumption violations, differences in coefficients, 95% confidence intervals, and model fit were compared. Results: Of the fourteen papers assessed, only three were inferentially reproducible. The most frequently violated assumptions were normality and independence, each occurring in eight papers. Violations of independence were particularly consequential and were commonly associated with inferential failure. Although reproduced analyses often retained the same binary statistical significance classification as the original studies, confidence intervals were frequently wider, indicating greater uncertainty and reduced precision. Such uncertainty may affect the interpretation of results and, in turn, influence treatment decisions and clinical practice. Conclusion: Our findings demonstrate that substantial violations of key modelling assumptions often went undetected by authors and peer reviewers and, in many cases, were associated with inferential reproducibility failure. This highlights the need for stronger statistical education and greater transparency in modelling decisions. Rather than applying rigid or misinformed rules, such as incorrectly testing the normality of the outcome variable, researchers should adopt modelling frameworks guided by the research question and the study design. When assumptions are violated, appropriate alternatives, such as robust methods, bootstrapping, generalized linear models, or mixed-effects models, should be considered. Given that assumption violations were common even in relatively simple regression models, early and sustained collaboration with statisticians is critical for supporting robust, defensible, and clinically meaningful conclusions.

6

Attitudes and Perceptions Toward the Use of Artificial Intelligence Chatbots for Peer Review in Medical Journals: A Large-Scale, International Cross-Sectional Survey

Ng, J. Y.; Bhavsar, D.; Dhanvanthry, N.; Bouter, L.; Chan, T.; Cramer, H.; Flanagin, A.; Iorio, A.; Lokker, C.; Maisonneuve, H.; Marusic, A.; Moher, D.

2026-04-07 health informatics 10.64898/2026.04.07.26350263 medRxiv

Top 0.1%

2.4%

Show abstract

Background: Artificial intelligence chatbots (AICs), as a form of generative artificial intelligence (AI), are increasingly being considered for use in scholarly peer review to assist with tasks such as identifying methodological issues, verifying references, and improving language clarity. Despite these potential benefits, concerns remain regarding their reliability, ethical implications, and transparency. Evidence on how medical journal peer reviewers perceive the role and impact of AICs is limited. This study explored reviewers' familiarity with AICs, perceived benefits and challenges, ethical concerns, and anticipated future roles in peer review. Methods: We conducted a cross-sectional online survey of medical journal peer reviewers. Corresponding author information was extracted from MEDLINE-indexed articles added to PubMed within a two-month period using an R-based approach. A total of 72,851 authors were invited via email to participate; those who self-identified as peer reviewers were eligible. The 29-item survey assessed familiarity with AICs and perceptions of their benefits and limitations in peer review. The survey was administered via SurveyMonkey from April 28 to June 16, 2025, with two reminder emails sent during the data collection period. Results: A total of 1,260 respondents completed the survey. Most participants were familiar with AICs (86.2%) and had used tools such as ChatGPT for general purposes (87.7%), but the majority had not used AICs for peer review (70.3%). Most respondents reported that their institutions do not provide training on AIC use in peer review (69.5%), although many expressed interest in such training (60.7%). Perceptions of AIC benefits were mixed, while concerns were widely shared, particularly regarding potential algorithmic bias (80.3%) and issues related to trust and user acceptance (73.3%). Conclusions: While familiarity with AICs is high among medical journal peer reviewers, their use in peer review remains limited. There is clear interest in training and guidance, however, concerns related to ethics, data privacy, and research integrity persist and should be addressed before broader implementation.

7

Demystifying Clone-Censor-Weight Method in Target Trial Emulation: A Real-World Study of HPV Vaccination Strategies

Lin, T.; Li, Y.; Huang, Z.; Gui, T. T.; Wang, W.; Guo, Y.

2026-04-22 health informatics 10.64898/2026.04.21.26351413 medRxiv

Top 0.1%

2.1%

Show abstract

Target trial emulation (TTE) offers a principled way to estimate treatment effects using real-world observational data, but analyses of time-varying treatment strategies remain vulnerable to immortal time bias. The clone-censor-weight (CCW) approach is increasingly used to address this problem, yet key aspects of its causal interpretation and implementation remain unclear. In this work, we emulate a target trial using electronic health records (EHRs) to compare completion of a 3-dose 9-valent human papillomavirus vaccination (HPV) series within 12 months versus remaining partially vaccinated among vaccine initiators. We link CCW to the classic potential outcome framework in causal inference, evaluate the role of different weighting mechanisms, and account for within-subject correlation induced by cloning using cluster-robust variance estimation. Our study provides practical guidance for applying CCW in real-world comparative effectiveness studies to address immortal time bias and supports more rigorous and interpretable treatment effect estimation in TTE.

8

Simulation-Based Comparison of ControlledInterrupted Time Series (CITS) and Multivariable Regression

ORWA, F. O.; Mutai, C.; Nizeyimana, I.; Mwangi, A.

2026-04-13 health policy 10.64898/2026.04.10.26350670 medRxiv

Top 0.1%

2.1%

Show abstract

When randomized controlled trials are impractical, interrupted time series designs offer a rigorous quasi-experimental approach to assess population level policies. Indeed, in the context of quasi-experimental designs (QEDs), the Interrupted Time Series (ITS) method is commonly thought of as the most robust. But interrupted time series designs are susceptible to serial correlation and confounding by time-varying factors associated with both the intervention and the outcome, which may result in biased inference. Thus, we provide a simulation-based contrast of controlled interrupted time series (CITS) and multivariable regression (multivariable negative binomial regression) for estimation of policy effects in count time series data. These approaches are widely used in policy evaluations, yet their comparative performance in typical population health settings has rarely been examined directly. We tested both approaches within a variety of data generating situations, differing in the series length, intervention effect size, and magnitude of lag-1 autocorrelation. Bias, standard error calibration, confidence interval coverage, mean squared error, and statistical power were assessed for performance. Both methods gave unbiased estimates for moderate and large intervention effects, although bias was more pronounced for small effects, particularly in short series. Although the point estimate performance was similar, inferential properties varied significantly. CITS always had smaller mean squared error, better consistency between model based and empirical standard errors, and confidence interval coverage near the 95% nominal levels over weak to moderate autocorrelation. By contrast, multivariable regression was more sensitive to serial dependence, leading to underestimated standard errors and undercoverage, especially at moderate to high autocorrelation, regardless of Newey-West adjustments. These findings show the benefits of using a concurrent control series and the importance of structurally accounting for serial correlation when studying population level policies with time series data.

9

Causal estimands and target trials for the effect of lag time to treatment of cancer patients

Goncalves, B. P.; Franco, E. L.

2026-04-08 epidemiology 10.64898/2026.04.07.26350338 medRxiv

Top 0.2%

0.8%

Show abstract

Timeliness of therapy initiation is a fundamental determinant of outcomes for many medical conditions, most importantly, cancer. Yet, existing inefficiencies in healthcare systems mean that delays between diagnosis and treatment frequently adversely affect the clinical outcome for cancer patients. Although estimates of effects of lag time to therapy would be informative to policymakers considering resource allocation to minimize delays in oncology, causal methods are seldom explicitly discussed in epidemiologic analyses of these lag times. Here, we propose causal estimands for such studies, and outline the protocol of a target trial that could be emulated with observational data on lag times. To illustrate the application of this approach, we simulate studies of lag time to treatment under two scenarios: one in which indication bias (Waiting Time Paradox) is present and another in which it is absent. Although our discussion focuses on oncologic outcomes, components of the proposed target trial could be adapted to study delays for other medical conditions. We believe that the clarity with which causal questions are posed under the target trial emulation framework would lead to improved quantification of the effects of lag times in oncology, and hence to better informed policy decisions.

10

Educational Browser-Native SIR Simulation: Analytical Benchmarks Showing Numerical Accuracy for Lightweight Epidemic Modeling

Ben-Joseph, J.

2026-04-17 epidemiology 10.64898/2026.04.15.26350961 medRxiv

Top 0.2%

0.8%

Show abstract

Lightweight epidemic calculators are widely used for teaching and rapid scenario exploration, yet many omit the methodological detail needed for scientific reuse. We present a browser-native SIR calculator that exposes forward Euler and classical fourth-order Runge--Kutta (RK4) integration alongside epidemiologically interpretable outputs and a population-conservation diagnostic. The implementation is anchored to analytical properties of the deterministic SIR system, including the epidemic threshold, the peak condition, and the final-size relation. Benchmark experiments show that RK4 is essentially step-size invariant over practical discretizations, whereas Euler at a coarse one-day step overestimates peak prevalence by 3.97% and final size by 0.66% relative to a fine-step RK4 reference. These results demonstrate that browser-based tools can support publication-quality computational narratives when solver choice, diagnostics, and assumptions are treated as first-class outputs.

11

Assessing Compliance with Reporting Requirements in European Phase II to IV Clinical Trials: A Cross-Sectional Observational Study

Bruckner, T.; Dike, C. E.; Caquelin, L.; Freeman, A.; Aspromonti, D. A.; DeVito, N.; Song, Z.; Karam, G.; Nilsonne, G.

2026-04-05 health policy 10.64898/2026.04.03.26350111 medRxiv

Top 0.2%

0.5%

Show abstract

Objectives: To assess the availability of key clinical trial registration data and compliance with legal reporting requirements for all Phase 2-4 drug trials registered on the new European Clinical Trial Information System (CTIS) registry. This study is the first ever assessment of data quality and legal compliance with reporting requirements on CTIS. Design: Cross-sectional observational study of CTIS registry data combined with manual review of results documents. Setting: Cohort of all 7,547 Phase II-IV clinical trials registered on CTIS as of November 2025. Main outcome measures: Number and proportion of missing data points in CTIS registration data. Proportion of completed clinical trials that are compliant with regulatory reporting requirements. Results: Trial registration data quality was high overall with more than 99% of expected data present. Of 234 clinical trials legally required to report results, fewer than half (49.6%) fully reported results within the required timeframe, 20 trials (8.5%) fully reported results late, and 98 trials (41.9%) failed to fully report results. Legal compliance was similar for adult trials (79/158) and paediatric trials (37/76). Conclusions: Sponsor compliance with legal reporting requirements is weak. Current efforts by European regulators to monitor and enforce compliance appear to be insufficient. New results reporting functions currently being set up by trial registries worldwide will require quality assurance processes. Trial registration: Study protocol prospectively registered on OSF: https://osf.io/sn4j2/overview

12

An adversarial approach to guide the selection of preprocessing pipelines for ERP studies

Scanzi, D.; Taylor, D. A.; McNair, K. A.; King, R. O. C.; Braddock, C.; Corballis, P. M.

2026-03-30 neuroscience 10.64898/2026.03.26.714586 medRxiv

Top 0.3%

0.5%

Show abstract

Electroencephalography (EEG) data are inherently contaminated by non-neuronal noise, including eye movements, muscle activity, cardiac signals, electrical interference, and technical issues such as poorly connected electrodes. Preprocessing to remove these artefacts is essential, yet the optimal method remains unclear due to the vast number of available techniques, their combinatorial use in pipelines, and adjustable parameters. Consequently, most studies adopt ad hoc preprocessing strategies based on dataset characteristics, study goals, and researcher expertise, with little justification for their choices. Such variability can influence downstream results, potentially determining whether effects are detected, and introduces risks of questionable analytical practices. Here, we present a method to objectively evaluate and compare preprocessing pipelines. Our approach uses realistically simulated signals injected into real EEG data as "ground truth", enabling the assessment of a pipelines ability to remove noise without distorting neuronal signals. This evaluation is independent of the studys main analyses, ensuring that pipeline selection does not bias results. By applying this procedure, researchers can select preprocessing strategies that maximize signal-to-noise ratio while maintaining the integrity of the neural signal, improving both reproducibility and interpretability of EEG studies. Although the data presented here focuses on processing and analysis most relevant for ERP research, the method can be flexibly expanded to other types of analyses or signals.

13

TernTables: A Statistical Analysis and Table Generation Web Interface for Clinical and Biomedical Research

Preston, J. D.; Abadiotakis, H.; Tang, A.; Rust, C. J.; Halkos, M. E.; Daneshmand, M. A.; Chan, J. L.

2026-04-20 bioinformatics 10.64898/2026.04.15.717241 medRxiv

Top 0.3%

0.5%

Show abstract

Clinical research dissemination is frequently hindered by administrative friction and methodological inconsistency. To address these barriers, we developed TernTables, a freely available, open-source web application (https://www.tern-tables.com/) and R package (https://cran.r-project.org/package=TernTables) that streamlines the transition from raw data to formatted results for descriptive and univariate clinical reporting. The system integrates a client-side screening protocol for protected health information (PHI) with a rule-based decision tree that selects and executes appropriate frequency-based, parametric, or non-parametric statistical tests based on data distribution and class. TernTables generates publication-ready summary tables in Microsoft Word format, complemented by dynamically generated methods text and the underlying R code to ensure complete transparency and reproducibility. Validation using a landmark clinical trial dataset demonstrated concordance with established biostatistical approaches for descriptive and univariate analyses. TernTables is designed to supplement, not replace, formal statistical consultation by standardizing routine descriptive and univariate workflows, allowing biostatistical expertise to be focused on complex analyses and study design. By lowering technical and financial barriers, the platform democratizes access to rigorous statistical workflows while maintaining methodological excellence and reducing "researcher degrees of freedom."

14

Covariate adjustment for hierarchical outcomes and the win ratio: how to do it and is it worthwhile?

Hazewinkel, A.-D.; Gregson, J.; Bartlett, J. W.; Gasparyan, S. B.; Wright, D.; Pocock, S.

2026-03-31 cardiovascular medicine 10.64898/2026.03.30.26347966 medRxiv

Top 0.3%

0.3%

Show abstract

Objectives: Introducing a new covariate adjustment method for hierarchical outcomes using ordinal logistic regression, comparing it with existing approaches, and assessing whether adjustment improves power in randomized trials with hierarchical outcomes. Methods: We developed an ordinal regression-based method for covariate adjustment of the win ratio and compared it with three alternatives: probability index models, inverse probability weighting, and a randomization-based estimator. Methods were applied to the EMPEROR-Preserved rial and tested through extensive simulations involving two common hierarchical outcome structures: time-to-event composites, and composites combining time-to-event with quantitative measures. Simulations assessed impacts on estimates, standard errors, and power across prognostic and non-prognostic settings. Results: In RCT data and simulations, covariate adjustment consistently increased power when adjusting for prognostic baseline variables. Gains were comparable to or greater than those in conventional Cox models, with no power loss for non-prognostic covariates. Our ordinal approach performed similarly to existing methods while providing interpretable covariate effect estimates. Adjusting for baseline values of quantitative components yielded power gains according to the baseline-to-follow-up correlation. Conclusions: Covariate adjustment for prognostic variables meaningfully improves efficiency in win ratio analyses for hierarchical outcomes. Our ordinal method is easily implemented and facilitates covariate effect interpretation. We recommend the broader adoption of covariate adjustment and our ordinal method in randomized trials using hierarchical outcomes.

15

Democratizing Scientific Publishing: A Local, Multi-Agent LLM Framework for Objective Manuscript Editing

Bhansali, R.; Gorenshtein, A.; Westover, B.; Goldenholz, D. M.

2026-04-17 health informatics 10.64898/2026.04.13.26350761 medRxiv

Top 0.3%

0.3%

Show abstract

Manuscript preparation is a critical bottleneck in scientific publishing, yet existing AI writing tools require cloud transmission of sensitive content, creating data-confidentiality barriers for clinical researchers. We introduce the Paper Analysis Tool (PAT), a free, multi-agent framework that deploys 31 specialized agents powered by small language models (SLMs) to audit manuscripts across multiple quality dimensions without external data transmission. Applied to three published clinical neurological papers, PAT generated 540 evaluable suggestions. Validation by two expert reviewers (R.B., A.G.) confirmed 391 actionable, high-value revisions (90% agreement), achieving a 72.4% overall usefulness accuracy spanning methodological, statistical, and visual domains. Furthermore, deterministic re-evaluation of 126 agent-suggested rewrite pairs using Phase 0 metrics confirmed text improvement: total word count decreased by 25%, passive voice prevalence dropped sharply from 35% to 5%, average sentence length decreased by 24%, long-sentence fraction fell by 67%, and the Flesch-Kincaid grade improved by 17% . Our validation confirms that systematic, agent-driven pre-submission review drives measurable improvements, successfully converting manuscript optimization from an opaque, manual endeavor into a transparent and rigorous scientific process. Manuscript preparation is a critical bottleneck in scientific publishing, yet existing AI writing tools require cloud transmission of sensitive content, creating data-confidentiality barriers for clinical researchers. We introduce the Paper Analysis Tool (PAT), a free, multi-agent framework that deploys 31 specialized agents powered by small language models (SLMs) to audit manuscripts across multiple quality dimensions without external data transmission. Applied to three published clinical neurological papers, PAT generated 540 evaluable suggestions. Independent validation by two expert reviewers (R.B., A.G.) confirmed 391 actionable, high-value revisions (90% agreement), achieving a 72.4% overall usefulness accuracy spanning methodological, statistical, and visual domains. Furthermore, deterministic re-evaluation of 126 suggested Phase 0 rewrite pairs confirmed text improvement: total word count decreased by 25%, passive voice prevalence dropped sharply from 35% to 5%, average sentence length decreased by 24%, and long-sentence fraction fell by 67%, and the Flesch-Kincaid grade improved modestly. Our validation confirms that systematic, agent-driven pre-submission review drives measurable improvements, successfully converting manuscript optimization from an opaque, manual endeavor into a transparent and rigorous scientific process.

16

Quantifying Scientific Consensus in Biomedical Hypotheses via LLM-Assisted Literature Screening

Kim, U.; Kwon, O.; Lee, D.

2026-04-09 bioinformatics 10.64898/2026.04.06.716861 medRxiv

Top 0.4%

0.2%

Show abstract

Systematic literature reviews are labor-intensive tasks in biomedical research. While Large Language Models (LLMs) using Retrieval-Augmented Generation (RAG) techniques have enhanced information accessibility, the inherent complexity of biological systems--characterized by high context dependency and conflicting data--remains a primary driver of LLM hallucinations. This imposes a structural constraint that limits the precision of evidence synthesis. To address these limitations, we propose an automated framework designed for the exhaustive identification of supporting and contradictory evidence within a target literature set. Rather than relying on a models pre-trained knowledge, our system requires the LLM to review each paper individually to determine its alignment with a specific research hypothesis. By evaluating semantic context, the framework captures subtle contradictions that are often overgeneralized by conventional methods. The frameworks performance was validated using the BioNLI task, where it demonstrated high classification accuracy in distinguishing whether evidence supports or contradicts a given hypothesis. Notably, the implementation of an ensemble approach provided superior stability and slightly higher precision compared to individual models. Furthermore, the framework exhibited robust performance across several well-established biological hypotheses, confirming its practical utility and reliability in real-world research. This approach provides a rigorous basis for biomedical discovery by enabling the precise, systematic analysis of biological literature and the robust collection of evidence.

17

Citation Hallucination Determines Success: An Empirical Comparison of Six Medical AI Research Systems

Shi, X.; Tian, Z.; Tan, S.; Wang, X.

2026-04-04 health informatics 10.64898/2026.04.02.26350091 medRxiv

Top 0.4%

0.2%

Show abstract

Large language model (LLM) systems can now generate complete research manuscripts, yet their reliability in clinical medicine - where citation accuracy and reporting standards carry direct consequences - has not been systematically assessed. We introduce MedResearchBench, a benchmark of three clinical epidemiology tasks built on NHANES data, and use it to evaluate six AI research systems across six quality dimensions. Evaluation combines programmatic citation verification, rule-based reporting compliance checks, and multi-model LLM judging, providing a more discriminative assessment than conventional single-judge approaches. Citation integrity emerged as the decisive quality dimension. Hallucination rates ranged from 2.9% to 36.8% across systems, and a hard-rule threshold on per-task citation scores capped four of six systems' total scores at the penalty ceiling. Adding a multi-agent citation verification and repair pipeline to the best-performing system improved its citation integrity score from 40.0 to 90.9 and raised the weighted total from 68.9 to 81.8. Strikingly, a single-model evaluation ranked this system last (55.5), while our three-tier framework ranked it first (81.8) - a complete reversal that exposes the limitations of subjective LLM-only evaluation. These results suggest that programmatic citation verification should be a core metric in future evaluations of AI scientific writing systems, and that multi-agent quality assurance can bridge the gap between fluent text generation and trustworthy scholarship.

18

Benchmarking Heritability Estimation Strategies Across 86 Configurations and Their Downstream Effect on Polygenic Risk Score Performance

Muneeb, M.; Ascher, D.

2026-04-02 bioinformatics 10.64898/2026.04.02.716079 medRxiv

Top 0.4%

0.2%

Show abstract

ObjectiveSNP heritability estimates vary substantially across estimation strategies, yet the downstream consequences for polygenic risk score (PRS) construction remain poorly characterised. We systematically benchmarked heritability estimation configurations and assessed their propagation into downstream PRS performance. MethodsWe benchmarked 86 heritability-estimation configurations spanning six tool families (GEMMA, GCTA, LDAK, DPR, LDSC, SumHer) and ten method groups across 10 UK Biobank phenotypes, yielding 844 configuration-level estimates. Each estimate was propagated into GCTA-SBLUP and LDpred2-lassosum2 PRS frameworks and evaluated across five cross-validation folds using null, PRS-only, and full models. Eleven binary analytical contrasts were tested using Mann-Whitney U tests to identify drivers of heritability variability. ResultsHeritability ranged from -0.862 to 2.735 (mean = 0.134, SD = 0.284), with 133 of 844 estimates (15.8%) negative and concentrated in unconstrained estimation regimes. Ten of eleven analytical contrasts significantly affected heritability magnitude, with algorithm choice and GRM standardisation showing the largest effects. Despite this upstream variability, downstream PRS test performance was only weakly coupled to heritability magnitude: pooled Pearson correlations between h2 and test AUC were r = -0.023 for GCTA-SBLUP and r = +0.014 for LDpred2-lassosum2 (both non-significant). ConclusionSNP heritability is best interpreted as a configuration-sensitive modelling parameter rather than a universally stable scalar input. Heritability estimates should always be reported alongside their full estimation specification, and downstream PRS performance is comparatively robust to moderate variation in the heritability input. Graphical Abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=80 SRC="FIGDIR/small/716079v1_ufig1.gif" ALT="Figure 1"> View larger version (27K): org.highwire.dtl.DTLVardef@112929borg.highwire.dtl.DTLVardef@573c36org.highwire.dtl.DTLVardef@132170borg.highwire.dtl.DTLVardef@1871363_HPS_FORMAT_FIGEXP M_FIG C_FIG

19

Track Display Jockey (trackDJ): a user-friendly R package for visualization of epigenomic data

Bokil, N. V.; Page, D. C.

2026-04-16 bioinformatics 10.64898/2026.04.15.718328 medRxiv

Top 0.4%

0.2%

Show abstract

BackgroundVisualization of epigenomic data such as coverage tracks, peak calls, and chromatin interactions is a critical task in genomic data analysis. Although genome browsers such as the Integrative Genomics Viewer (IGV) and the UCSC Genome Browser permit user-friendly exploration of genomic tracks, they are not optimized for fully programmatic and reproducible generation of publication-quality figures. In contrast, existing programmatic tools lack a user-friendly interface and require extensive configuration. ResultsWe present trackDJ (Track Display Jockey), an R package for visualization of epigenomic data. trackDJ prioritizes usability by favoring convention over configuration; it provides high-level plotting functions with sensible defaults, allowing users with minimal programming experience to generate clear, publication-quality figures with relatively little coding. Within a unified plotting framework, users can stack and align multiple data types, including coverage tracks, peak annotations, chromatin loops, and gene annotations. trackDJ allows users to select plotted genomic regions by coordinates or by gene name, enabling rapid visualization without knowledge of precise locus boundaries. ConclusionstrackDJ provides a user-friendly and reproducible alternative to interactive genome browsers for epigenomic visualization, filling a critical gap in currently available epigenomics toolkits. By enabling scripted generation of clean, customizable genomic illustrations, trackDJ integrates naturally into R-based analysis workflows to streamline the creation of publication-quality figures.

20

Understanding unexpected results from randomized clini{square}cal trials Does coffee reduce atrial fibrillation recurrences?

Brophy, J. M.

2026-04-17 cardiovascular medicine 10.64898/2026.04.13.26350787 medRxiv

Top 0.5%

0.2%

Show abstract

ObjectiveTo explore the interpretation of unexpected results from a randomized controlled trial (RCT). Study Design and SettingAdjunctive frequentist (power and type{square}M error) and Bayesian analyses were performed on a recently published RCT reporting a statistically significant relative risk reduction (p <0.01) for caffeinated coffee drinkers compared with abstinence on atrial fibrillation (AF) recurrence. Individual patient data for the Bayesian survival models were reconstructed from the RCT published material and priors informed by the RCT power calculations. ResultsThe original RCT design had limited power for realistic effect sizes, increasing susceptibility to type{square}M (magnitude) error. Bayesian analyses also tempered the benefit for caffeinated coffee implied by standard statistical analysis resulting in only modest probabilities of clinically meaningful risk reductions (e.g., hazard ratio < 0.9 of 88% or a risk difference > 2% of 82%). ConclusionsSupplemental frequentist and Bayesian approaches can provide robustness checks for unexpected RCT findings, providing contextualization, clarifying distinctions between statistical and clinical significance, and guiding replication needs. HighlightsO_LIRandomized controlled trial (RCT) results may be unexpected and challenge prior beliefs C_LIO_LISupplemental frequentist and Bayesian analyses can clarify interpretation of surprising findings C_LIO_LIPower and type{square}M error assessments help evaluate design adequacy for realistic effects C_LIO_LIBayesian posterior probabilities provide additional nuanced insights into contextulaization and clinical significance C_LI